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Abstract 

In Min-Sum 2-Clustering problem, also known as 2-cluster graph editing and 
correlation 2-clustering, we are given a graph G = (V, E) and a parameter k; and 
the goal is to determine if there exists a 2-partition of V such that the total conflict 
number is at most k, in which the conflict of a vertex is the number of its non- 
neighbors in the same cluster and neighbors in the different cluster. In this paper 
we show a parameterized algorithm with time complexity 0(n-2.618 r ^ 1 ~ 4r// ™- ) +n 3 ), 
in which r = k/n. Particularly, the time complexity is 0*(2.618 fc / n ) for k 6 o(n 2 ) 
and polynomial for k £ 0(n log n) . We also design a parameterized algorithm for a 
variant which minimizes the sum of squared conflicts. For k £ o(n 3 ) ; the algorithm 
runs in 0(n 3 • 5.171 ) time, in which 6 = yjk/n. 

Key words, parameterized algorithm, kernelization, cluster graph, clustering, 
graph modification. 
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1 Introduction 



Problem definition and motivation. Clustering is an important problem with ap- 
plications in numerous fields, and the cluster graph editing problem, also known as 
correlation clustering is a graph theoretic approach to clustering [21 [22]. A cluster 
graph is a graph composed of disjoint maximal cliques. The cluster graph editing 
problem asks for a minimum number of edge insertions and deletions to modify the 
input graph into a cluster graph. As a variant, the p-cluster graph editing problem 
is to modify the input into exactly p disjoint cliques [22]. The cluster graph editing 
problem focuses on the editing set, i.e., the edges to be inserted and deleted. As 
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in the correlation clustering problem, an editing edge can also be thought of as 
a disagreement in the clustering. If we change the point of view from edges to 
vertices, an editing edge represents a deficiency on its endpoints in the cluster- 
ing. Let conflict number of a vertex be the number of editing edges incident to it. 
Then, an editing set of cardinality k one-to-one corresponds to a clustering with 
total conflict number 2k. That is, the cluster graph editing problem is equivalent 
to finding a vertex partition with minimum total conflict. The transformation of 
problem definition provides us easier ways to define other meaningful objective 
functions on the conflict numbers, such as sum of squares. 

In this paper we focus on 2-clusterings. For a set V, a 2-partition of V is an 
unordered pair it = (V\, V2) such that V\ D V2 = and V\ U V2 = V . In this paper 
we assume V\, V2 7^ 0. We say that V\ and V2 are the two clusters corresponding to 
7T. For a 2-partition it of V, two vertices u and v conflict with each other if they are 
in the same cluster but (u, v) ^ E or they are in different clusters but (u, v) £ E. 
Let Cn(v) denote the set of vertices conflicting with v in ir and c n (v) = |Cr(?;)| 
be the conflict number of v. Clearly u G C n (v) if and only if v G C n (u). For 
an input graph, a 2-clustering problem in general asks for a 2-partition such that 
the conflicts are as small as possible. In this paper we study the 2-clustering 
problem with two natural objective functions: minimizing the sum of conflicts 
and the sum of squared conflicts. Precisely speaking, let hi(n) = J2vev c tt( v ) an d 
^2(71") = J2 v <av c w( v )- The problems are formally defined as follows. We shall focus 
on their decision versions. 

Problem: Min-Sum 2- Clustering 

Instance: A graph G = (V, E) and a nonnegative integers k. 
Question: Is there a 2-partition ir of V such that hi(n) < k. 

The second problem, named Min-Square 2-Clustering, is defined similarly, 
except that the objective function is hi- 

Previous results. Shamir et al. [22] studied the computational complexities of 
three edge modification problems. Cluster Editing asks for the minimum total 
number of edge insertions and deletions to modify a graph into a cluster graph, 
while in Cluster Deletion (respectively, Cluster Completion), only edge 
deletions (respectively, insertions) are allowed. They showed that Cluster Edit- 
ing is NP-hard, Cluster Deletion is Max SNP-hard, and Cluster Comple- 
tion is polynomial-time solvable. They also showed that ^-Cluster Deletion 
is NP-hard for any p > 2 but polynomial-time solvable for p = 2, and ^-Cluster 
Editing is NP-hard for any p > 2. 

An instance of a parameterized problem consists of (I, k), where k is the param- 
eter. A problem is fixed-parameter tractable (FPT) if the problem can be solved 
in time complexity 0(f(k) ■ p(\I\)), where / is an arbitrary computable function 
of k and p is a polynomial in the input size. For more details about parameterized 
complexity, we refer to the book of Downey and Fellows [11] . Kernelization is a 
widely-used technique for parameterized algorithms. Usually, by some designed 
reduction rules, a kernelization algorithm converts an instance (I, k) to a reduced 
instance (I',k'), called a kernel such that the answer is not changed, k' < k and 
1 1'\ is bounded by a computable function of k. 
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The parameterized version of Cluster Editing, and variants of it, were studied 
intensively [3l [U El [lOj [161 El EE]- A variant with vertex (rather than edge) 
deletions was considered in [20] , and another variant in which overlapping clusters 
are allowed was studied in |12j . 

The correlation clustering problem (on complete signed graphs) was formulated 
and studied in |2] , in which the authors presented a PTAS for the maximization ver- 
sion and a constant factor approximation algorithm for the minimization version. 
Ailon et al. [lj showed some approximation results for the minimization version, 
including the unweighted case and the weighted case with some interesting assump- 
tions on weights. Giotis and Guruswami showed that both the minimization and 
the maximization versions of the p-Correlation Clustering problem admit PTAS 
for the unweighted case [15j . The maximization 2-Correlation Clustering problem 
is also known as Balanced Subgraph which name comes from the application 
in social network analysis [191 [23] . 

For ^-Cluster Editing, a kernel with (p + 2)k + p vertices was given by Guo 
|18] , A variant such that the conflict number of each vertex must be bounded by 
a parameter was studied in [2T] . The problem of finding a 2-clustering minimizing 
the maximum conflict number has been shown NP-hard |§J. Very recently, Fomin 
et al. gave a parameterized algorithm with time complexity 0(2°^"^^) + n 2 ), 
where n is the number of vertices [2]. They also showed a lower bound for the 
parameterized complexity: there exists p = Q(k a ) for any constant < a < 1 such 
that it is very unlikely to solve the problem with time complexity 2°^V / p^) • n ^\ 

Our contributions. Recall that Min-Sum 2-Clustering with parameter 2k is 
equivalent to 2-Cluster Editing with parameter k. We should first note that 
it is not hard to solve Min-Sum 2-Clustering in polynomial time for k G 0(n). 
As an extreme example, if k < n, by pigeonhole principle, there must be a vertex 
whose neighborhood in the solution is the same as in the input graph, and thus the 
solution can be found by trying (Nq[v], V(G) \ Nq[v]) for all v G V(G), where G 
is the input graph and Ng[v] denotes the closed neighborhood. This observation 
can be easily extended to k £ 0(n) and leads to a polynomial-time algorithm. 
Therefore, for Min-Sum 2-Clustering, a better result should be expected. 

In this paper we develop parameterized algorithms for Min-Sum 2-Clustering 
and Min-Square 2-Clustering. We design a kernelization algorithm for MlN- 
Sum 2-Clustering. First, the problem is equivalent to finding a flipping set of 
a bounded size for an initial vertex 2-partition, i.e., determining which vertices 
should be swapped to the other cluster. By a precise analysis, it is shown that the 
kernel size is at most 2k /(n — 4/) — / if there is a flipping set of size /. The kernel- 
ization algorithm iteratively decreases the flipping quota / and flips the vertices 
with too many conflicts until the kernel size fits the bound. Then, a search-tree 
algorithm is employed to determine if there exists any flipping set of size bounded 
by the remaining flipping quota. By analyzing the worst case of the remaining flip- 
ping quota, we show that the problem can be solved in 0(n ■ 2.618 r /( 1_4r//n ) + n 3 ) 
time, in which r = k/n. Particularly, the time complexity is 0*(2.618 fc//n ) for 
k £ o(n 2 ) and polynomial for k £ O(nlogn), where 0*(-) is as 0{ ) ignoring poly- 
nomial factors. We also note that this result is better than 0*(2°^V / p^)) recently 
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obtained by Fomin et al. |14j for the special case of p = 2. Even when k = 5n 2 with 
small constant <5, our algorithm improves the brute-force algorithm significantly. 
For example, when 5 = 0.1, the time complexity is (9*(2°- 231n ), much better than 
0*(2 n ). 

The other contribution is an 0(n 3 • 5.171 9 )-time algorithm for Min-Square 
2-Clustering, where 6 = \fkfn and k G o(n 3 ). The result is obtained by a 
similar method and some observations on the optimal solutions. 

Organization of the paper. In Section 2 we give some notation and definitions, 
as well as some properties, used in this paper. The reduction algorithm is in 
Section 3. In Sections 4 and 5 we show the algorithms for the two problems, 
respectively. Finally some concluding remarks are in Section 6. 

2 Preliminaries 

For two sets S± and S2 , the set difference is denoted by Si \ S2 , and the symmetric 
difference is denoted by S\ Q S2 = (Si \ S2) U (S2 \ Si). For simplicity, Si Q v = 
Si Q {v} for v £ Si. For a graph G, the vertex set and the edge set are denoted 
by V(G) and E(G), respectively. For S C V(G), the subgraph induced by S is 
denoted by G[S\. Throughout this paper, G = (V,E) is the input graph and 
n = \ V\. The terms "2-partition" and "conflict" have been defined in the previous 
section. For S C V, let C n (v, S) = C n (v)nS and c^{v, S) = \C n (v, S)\. For disjoint 
Si and S 2 , ^(Si, S2) = X^eSi Cn(v,S2). Note that (^-(Si, £2) = ^(S^, 5i)- When 
there is no confusion, we shall omit the subscript and simply use c(-) instead of 



In the literature, Min-Sum 2-Clustering is also known in another formula- 
tion: 2-Cluster Editing. A graph G = (V, E) is a 2-cluster graph if it consists 
of exactly two disjoint cliques. That is, there exists a 2-partition it = (Vi, V2) of 
V such that both G[V\] and GfVy are maximal cliques. For G = (V,E), a set 
D C V x V is an editing set (to 2-cluster graph) for G if G' = (V, E Q D) is 
a 2-cluster graph. In other words, G can be modified into a 2-cluster graph by 
inserting D\E and deleting DDE. Given a graph G and integer k, the 2-cluster 
graph editing problem asks if there is an editing set D for G such that \D\ < k. 
Let E*(ir) = Ui=i 2{{ u i v )\ u -> v G Vi} which is the edges set of the 2-cluster graph. 
By definition, if D is an editing set and tt is the corresponding 2-partition, we 
have E Q D = E*(tt) and D = \J veV {(v,u)\u £ C n (v)}. Therefore, finding an 
editing set is equivalent to finding the corresponding 2-partition. Furthermore, 
\D\ = (l/2)J2 V £v c *( v )- Consequently the problem is equivalent to Min-Sum 
2-Clustering. 

To flip a vertex v in a 2-partition tt is to move v to the other cluster, that is, 
we change tt to ttQv = (ViQv, V^Qv). Flipping a vertex subset S changes (Vi, V%) 
to (Vi eS,V 2 eS). If tt' = it G v, then C^(v) = V \ {v} \ C n {v). That is, flipping 
a vertex exchanges its conflicts and non-conflicts. Furthermore, for a flipping set 
F, only those conflicts in F x (V \ F) change. If tt' = tt Q F, then 



( 



C 7T (v,F)U(F\C w (v,F)) ifveF 
C 7T (v,F)U(F\C n (v,F)) iiveF 



(1) 
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in which F = V \ F. 

It is convenient to represent the conflicts as a conflict graph. The conflict graph 
of 7r is a simple undirected graph H with V(H) = V and E(H) = \J ve y{(v, u )\ u e 
CV(u)}. If H' is the conflict graph of ir' = it © F, we have 

E(H') = E(H[F]) U E(H[F]) U {(u, v)\u G F, v G F, (u, u) £ E(H)}. (2) 

The profit of a flipping set F, denoted by A(F), is the decrement of total conflict 
after flipping F, i.e., A(F) = h^ir) - h^ir Q F) = C A V ) ~ c*qf{v ). 

Lemma 1: A(F) = Ac^F.F) — 2|F||F|. 
Corollary 2 : A({v}) = 4 C7T (v) - 2(n - 1). 

3 Reduction algorithm 

In this section, we propose a preprocessing algorithm which reduces the instance 
size. Consider the following problem. 

Instance: A graph G = (V,E), a vertex s G V, and nonnegative 
integers K, /, and t. 

Question: Is there a 2-partition ir of V such that J2v c n( v ) < ^ 5 
c w (s) < / and ^(v) < t for any other vertex t> G V? 

Definition 1: Let 7r be a 2-partition of V and i^, i be nonnegative integers. A 
vertex subset F is a (K,t, f)- feasible flipping set for ir if |F| < /, ^2 v c n >(v) < K, 
and <v(i;) < t for any other vertex u G V, where ir' = tt Q F. 

We assume t + 2f < en for some constant e. Let tt s = (Nq[s], V \ Nq[s]) in which 
Ng[s] is the closed neighborhood of s in G. The above problem is equivalent to 
determining a (K, t, /)-feasible flipping set for tt s . 

The reduction algorithm outputs a 2-partition 7r, a vertex subset U C V and 
an integer m < f such that there exists a (K, t, /)-feasible flipping set in V for 
ir s if and only if there exists a (K, t, m)-feasible flipping set in U for n. We shall 
call / the flipping quota. To avoid confusion, we use different symbols / and m 
for the flipping quotas before and after the reduction algorithm, respectively. The 
algorithm is based on the following reduction rules for any 2-partition ir and integer 

/• 

Rl: For any vertex v, if c n (v) > t + f, v must be flipped. 

R2: For any vertex v, if c^(v) < n — t — f, v cannot be flipped. 

R3: If / = and there exists v with c n (v) > t+f, there is no any feasible flipping 
set. 

R4: If c n (v) < t + / for all v G V and \U\ > K/(n - t - 2/) - /, there is no any 
feasible flipping set of size /, in which U = {v\c w (v) > n — t — /}. 
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In the following we shall show the correctness of the reduction rules. We start 
from the first three simple rules. 

Lemma 3: The reduction rules R1-R3 are correct. 

By R2, we can put those vertices with c n (v) < n — t — f into a set X, and when 
none of the first three rules can be applied, U = V \ X is the set of undetermined 
vertices. We note that, once a vertex is flipped, it should be put into X. But in 
fact we do not even need to divide V into X and U until Rl is no more applicable. 
The rule R4 is an upper bound of \U\. We now show its correctness. Suppose that 
F C U is a feasible flipping set. Let F = V \ F. In the remaining paragraphs of 
this section we omit the subscript ir. 

We start from a simple but weaker bound. Since c(v) < t + f, by Lemma[H the 
profit of F is A (F) = 4c(F, F) -2|F||F| < 4/(i+/) -2/(n-/) = (4t-2n)/ + 6/ 2 . 
Immediately, if J2v c ( v ) > K + (At — 2n)f + 6/ 2 , there does not exist any feasible 
flipping set. Since c(v) > n — t — f for any v G U, 

(n-t- f)\U\<Y, < E < v ) <K+(4t- 2n)f + 6/ 2 , 

veil v&v 

and therefore 

n-t- f 

We shall show the better bound in R4 by a more precise analysis. 

Lemma 4: If there exists a feasible flipping set of size /, then \U\ < K/(n — t — 
2/)-/- 

We note that the upper bound is for the case of flipping set of size exactly 
/. As / decreases, the upper bound increases but \U\ decreases because possibly 
more vertices are flipped. Since our goal is to find a flipping set of size at most /, 
the algorithm iteratively decreases / until the bound is satisfied. The reduction 
algorithm is in Algorithm [TJ 

Lemma 5: The algorithm Reduction takes 0(n 2 ) time. 

4 Minimum sum of conflicts 

In this section we show a parameterized algorithm for Min-Sum 2-Clustering. 
We start from two simple properties. 

Fact 6: If ir is a 2-partition with minimum /ii(7r), then c n (v) < (n — l)/2 for any 
v. 

The next property comes from the definitions. 

Fact 7: For any 2-partition ir, there exists a vertex v with conflict number at 
most h\{n)/n. 
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Algorithm 1 : REDUCTION^, s, K, t, f) 

Input: a graph G = (V, E), a vertex s £ V, and integers K, t and /. 
Output: a 2-partition tt, a vertex subset U, and an integer to. 

1: initially vr = (iV G [s], V \ JV G [a]); 

2: while 3u such that c n (v) > t + f do 
3: 7r <— 7T V] //flipping v 
4: update c^iu) for each u; 

6: if / < then 

7: report "No" and terminate; // R3, no solution 

8: end if 

9: end while 

10: let U = {v\c n (v) > n - t - /}; 

11: ^ \U\ > ^j-Zthen 

12: /W-i; 

13: if / < then report "No" and terminate; 
14: goto Step El 
15: end if 

16: return (ir,U,m = /); 



By the above two facts, we can focus on determining a 2-partition tt such that 
hi (n) = J2v c t(, v ) — k, c n (s) < k/n for some s G V, and c^{v) < (n — l)/2 for each 
Hence we can use the reduction algorithm with K = k, t = (n — l)/2 and 
/ = A;/n. The remaining work is to check if hi(nQF) < k for every possible flipping 
set F of size at most m, in which m is the flipping quota returned by the reduction 
algorithm. This work can be done by a simple search-tree algorithm on the reduced 
instance. The search-tree algorithm picks up an arbitrary undetermined vertex v 
and recursively solves the problem for two cases: flipping v or not. To make the 
reduced instance a kernel, the time complexity of the search-tree algorithm should 
depend on the kernel size rather than the input size. Fortunately, for Min-Sum 
2-Clustering, it can be easily done by the following method. 

Let U be the set of undetermined vertices and X = V \ U. For each u £ U, 
we record c n (u,X). Also we record x = c n(X,X) and \X\. When moving a 
vertex u from U to X, we update x to x + 2c 7r (u, X) and also update c n (v, X) to 
c K (v,X) + c K (u,v) for each v 6 U\{u}. When flipping and removing u, we update 
X to x + 2(|X| — c n (u, X)) and also update c w (v, X) to c n (v, X) + 1 — c n (u, v) for 
each v € U \ {u}. This work can be done in 0(|{7|) time if the conflict graph 
induced by U is used as the input of the search-tree algorithm. Since the number 
of conflict edges is 0(\U\ 2 ), each recursive call takes Od^/j 2 ) time. 

The whole algorithm is in Algorithm [21 Note that we need to try every vertex 
v as the input of Reduction. 

In the remaining paragraph of this section we shall show the time complexity of 
Algorithm [2 The most important thing is the time complexity of the search-tree 
algorithm. 

Lemma 8: The search-tree algorithm runs in 0((^ u ^ +m ) time, in which (p = 
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Algorithm 2 : Finding min-sum 2-clustering 
Input: a graph G = (V, E) and integer k. 
Output: a 2-partition 7r with h\(ir) < A; or output "No". 
1: for each v G V do 

2: call Reduction(G, u, fc, (n — l)/2, fe/n) to compute (ir,U,m); 
3: if Reduction reports "No" then 
4: goto step [0 

5: end if 

6: construct the conflict graph H induced by U ; 

7: call search-tree algorithm with m, k, and H as input; 

8: if a feasible solution is found then 

9: output the answer and terminate; 

10: end if 
11: end for 
12: report "No"; 



« 1.618. 

Theorem 9: Min-Sum 2-clustering can be solved in 0(n • 2.618 r /( 1_4r / n ) +n 3 ) 
time, in which r = k/n. 

When k G o(n 2 ), since m < /c/n, (4m /n) (k/n) < ek/n for any constant e > 
when n is sufficiently large. Then 

k Ik In 

\U\ < — m= t — ; m < (2 + e)(k/n) - m. (4) 

1 1 ~ n/2-2m 1 - 4m/n y A 7 7 w 

Since e can be arbitrarily small, <fi( 2+£ ^ ~ ^> 2 « 2.618. 

Corollary 10: When G o(n 2 ), Min-Sum 2-clustering can be solved in 0(n- 
2.618 fc / n + n 3 ) time. 

Corollary 11 : Min-Sum 2-clustering can be solved in polynomial time if 
k G 0(n log n). 

When G 0(n 2 ), the time complexity can be expressed as follows, in which 
the condition 5 < 0.186 is to ensure the result is better than the naive 0*(2 n )-time 
algorithm. 

Corollary 12 : When k = 5n 2 with 5 < 0.186, Min-Sum 2-clustering can be 
solved in 0(n ■ 2.618 5n /( 1 " 4<5 ) + n 3 ) time. 



5 Minimizing the sum-of-squares 

Recall that h^n) = ^2 v c^.(v) is the sum of squared conflicts for a 2-partition n. 
Given a graph G and an integer k, Min-Square 2-clustering determines if there 
exists a 2-partition tt with ^(tt) < k. In this section, we show a parameterized 
algorithm with parameter k. 
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Lemma 13: If /^(tt) < k, then J2 V c ^( v ) — Vnk and there exists a vertex v with 
c n (v) < y/k/n. 

Lemma 14: If tt is a 2-partition with minimum /12 (?>"), then c n (v) < y/n(n — l)/2 
for any vertex u. 

The parameterized algorithm for Min-Square 2-clustering is similar to the 
one in the previous section. By the above two lemmas, we try every vertex as the 
one with conflict quota at most / = y/k/n, and for each iteration use Reduction 
with total conflict K = y/nk, individual bound t = y/n(n — l)/2 and flipping 
quota / = y/k/n. For the reduced instance U and m, a search-tree algorithm is 
employed to check all possible flipping sets. But, unfortunately, we did not find 
a way to compute the objective function h>i with time complexity only depending 
on |Z7|. Therefore it takes 0(n 2 ) time to compute /12 for each flipping set. 

Theorem 15: For k G o(n 3 ), Min-Square 2-clustering can be solved in 0(n 3 - 
4> 3A14d ) pa 0(n 3 ■ 5.171 9 ) time, in which 9 = y/k/n. 

Corollary 16: For k G 0(n log 2 n), Min-Square 2-clustering can be solved 
in polynomial time. 

Similar to Min-Sum 2-clustering, we can also derive the time complexity 
for k = B(n 3 ). But it is better than the brute- force algorithm only when k/n 3 is 
very small. 

Corollary 17 : For k = 5 2 n 3 , Min-Square 2-clustering can be solved in 
0(n 3 ■ p n ), in which p = 0i-(4+2v^)«. 

6 Concluding remarks 

In this paper, we show parameterized algorithms for 2-clusterings with minimum 
conflict and minimum sum of squared conflicts. The minimum conflict problem 
is equivalent to the 2-cluster graph editing problem in the literature. As in many 
other optimization problems, the sum-of-squares objective function usually pro- 
vides solutions with both small total amount and small individual amounts. The 
proposed algorithms significantly improve the brute-force algorithm when k is 
relatively small, i.e., k G o(n 2 ) for Min-Sum 2-clustering and k G o(n 3 ) for 
Min-Square 2-clustering. 

The time complexity in Lemma [8] is shown by induction, which is then used to 
derive the time complexities of the two algorithms. One may wonder if the bound 
is over estimated. In the appendix, we show the bound is tight when k is relatively 
small. 

A straightforward interesting question is how to generalize the algorithms to 
the case of more than two clusters. However, the problem seems to become much 
more difficult when the number of clusters is more than two. There are also some 
other interesting objective functions, such as min-max conflict, minimum weighted 
sum. Another type of interesting problems is to relax the definition of cluster, such 
as minimum editing to fc-plex, minimum editing to cluster graph with bounded 
conflict. 
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Appendix 

In Lemma El we show the time complexity of the search-tree algorithm is bounded 
by 0(4>\ u \ +m ). Here we show that the bound is tight for small k, i.e., for k S o(n ) 
in MlN-SUM 2-clustering (Corollary [TO]) and k G o(n 3 ) for Min-Square 2- 
CLUSTERING (Theorem [T5]) . 

As shown in the proof of Theorem [T5l the number of leaf nodes of the search 
tree is J_^o ( i )• ^ ^ s sufficient to show that 

lUl ) > <^l+™. (5) 
to I 

Lemma 18: Suppose that (3 is a decreasing function of to. Let a = m//3. The 
maximum of (^j occurs at to = too, in which too is the solution of 

rv 

- ■ log(l - a) - log = (6) 



max 

m 
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Proof: By String approximation, 



log r ) «/3-A(a), (7) 



in which A(e) = — elog(e) — (1 — e) log(l — e) is known as the binary entropy of e 
j. In the following we use A', ft' and a' for the derivatives with respect to m. 



a' >v , , , N . , —a' 



A' = -(a'loga + a-(— -))-(-a'log(l-a) + (l-a)-(- — -)) 

a In 2 (1 — a) in 2 

= — (a log a — a log(l — a) = —a' log ■ 



1 — a 
c- / 8-mB' 

bmce a = p$ , 



ft-X' = -(l-^)log " 



Then, 



/3 7 ° 1 - a 

log h a/3' • (log a - log(l - a)). 

1 — a 



— logf ^ ] 
dm \m J 

■ A + ■ X' 

—ft' (a log a + (1 — a) log(l — a)) 

- log h a/3' • (log a - log(l - a)) 

1 — a 

-/3'log(l-a)-lo ( ° 



1 - a 



Since a = m//3 is an increasing function with respect to m. By assumption 
/3' < 0, the derivative is positive for sufficiently small a. When m goes from small 
to large, the derivative becomes negative when a is sufficiently large. Therefore, 
max m (^) occurs at the value ttiq which makes the derivative zero, i.e., the solution 
of equation © • D 

Let ft = 2k j n — m. First we note that max m ( ( " +1)/ ^ 2m_m ) — uiax m (^), and 

we show © by a proof of > (j> 2h/n - 

Apparently ft'{m) = —1 and ft is a decreasing function. By Lemma [T51 the 
maximum occurs at mo which is the solution of ([6]). Let ao = a (mo) and fto = 
ft (mo). Since ft' = —1, we have 

Iog(l-ao) = log-^_, (8) 
1 - a 

or equivalently (1 — ao) 2 = ao- We have that ao = (3 — Vo)/2. By definition, 
A(a ) = -a loga - (1 - a )log(l - a ) 
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Then, since a = m/f3, 



l3 X(a ) 

= -/3 a log a - A)(l - a ) log(l - a ) 
= -mo log a - (A) - w ) log(l - a ) 

= -m log—^ /5 log(l - a ) 

l-ao 

= -m log(l - a ) - /S log (1 - «o), 

in which the last step comes from ([8]). Since — log(l — cxq) = log y 
log 4>, we obtain /3oA(ao) = (mo + So) log <p. By ([7]) and So + m o 
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